Goal:

In this report, we are going to analyze 1000 hotels across the US. and customers’ reviews toward each hotel. With the data we have, three main questions will be addressed in this data analytic report.

  1. Which states have a relatively developed tourism industry?

This question could be answered by looking at the number of hotels in each states. There is a significant relationship between tourism and hotel industry. Major growth of tourism leads to development of desirable infrastructural facilities such as hotel facilities in the country. Thus, if the state has the highest number of hotels, in general, the state will have a relatively developed tourism industry.

  1. Which states have a relatively higher average hotel rating?

Knowing this information is helpful for tourists who care about the quality of hotels they live when they travel. Tourist can have a better insight to decide which state they are going to visit.

  1. What kind of sentimental words can lead to a higher hotel rating?

Hotel rating is an important index for hotel owners and managers to focus on because a higher hotel rating could attract more tourists to visit their hotels, and in the meanwhile, they could earn more profit. Thus, through the investigation of word sentimental analysis, hotel owners and managers can make a more cost and time efficient plan to “direct” consumers to write more review comments containing sentimental words which are associated with higher hotel rating.

1) Which states have a relatively developed tourism industry?

From the map, we can tell that States like California, Texas, Florida, and New York are states that have most hotels. Among these states, California have the most hotels, 351 hotels. The state that holds least hotels are Alaska, which only has 14 hotels.

We think that only hotel count is not informative in identifying which state is developed in tourism, so we defined a variable – Tourism Development Degree. The variable is calculated by hotel count divided by population density of that state. By doing this, we take both population and state’s area(sqaure miles) into account. The darker the color is means the the state is more developed in tourism. From the map, we can tell that Alaska has the highest degree of development because they have 14 hotels while their population density is very small. Interestingly, California seems not very developed in their tourism due to its highe population density.

2) Which states have a relatively higher average hotel rating?

## `summarise()` ungrouping output (override with `.groups` argument)

When it comes to vacations, most people think about Florida, the California coast or Las Vegas. These famous resorts generally have many hotels, but are these hotels rated as high as their popularity? With this question in mind, we calculated the average rating of all hotels in each state and visualized the top ten states. New Mexico has the highest average hotel rating with 4.41 points; followed by Alabama with 4.36 points, and Utah with 3.56 point. These states are not some famous places, but their average scores are high. As this graph only shows the top 10 states and the geographical information is missing, we did further analysis.

Here is the map for the distribution of all of the hotels in the dataset. From the map, we can see that most of the hotel are located on mainland United States of America. If we zoomed in on the map, we can see the biggest hotel clusters are in some of the most populated states, such as California, Florida, New York, etc. Also, for Hawaii, the area is small compared to other locations, but there are 30 hotels sampled from there. Moreover, the hotel is divided into bad and good hotels based on the average ratings of all reviews. If the rating is below the average, it is considered as bad hotel. It is consdiered as good hotel if rating is over average.

## `summarise()` ungrouping output (override with `.groups` argument)

We display the geographic location of each state, and the darker the color indicates the higher the average score. This result is consistent with the previous histogram. We can see that these places in the central United States and New Mexico are indeed highly rated. But combined with the previous graph, which shows that there are not many hotels in these places. We can say that since there are not a huge amount of hotels, while the qulity of hotels is high, the average rating is relatively high. In fact, if we look at New York and its surroundings, we will find that there are more hotels in New York and their ratings are high. So New York may be a good choice for traveling if you would like to live better.

3) What kind of sentimental words can lead to a higher hotel rating?

According to the sentiment analysis, we may find that trust is the top 1 words in hotel reviews and joy is the top 2. Those are all positive words and it corresponds to the hotel average rating score, which is above 4.0 on average.

1. Sentiment Analysis (Positive & Negative)

After pre-processing the text data provided in the column of hotel review, we first broke all the reviews into single words (unigram). Then, we combined the whole bag of single words with the bing dictionary we imported to make the comparison word cloud. This provides us with an overview of the positive and negative words people used in their hotel reviews.

2. Sentiment Analysis (Emotions)

This is the plot shows the coefficient of each emotions. Based on the graph, we can see that surprise is the most positive emotions. If the review contains one surprise related words, the review is expected to increase by 0.02 out of 5 in ratings. On the other hand, disgust is the most negaitve emotion. If one more word related to disgust is used, the the review rating is expected to decrease by 0.04 out of 5.

summary(model_nrc)
## 
## Call:
## lm(formula = x ~ ., data = nrc_a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0125 -0.4869  0.1308  0.9710  2.9585 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.960474   0.022077 179.394  < 2e-16 ***
## sentiment_anger         0.017298   0.008278   2.090 0.036736 *  
## sentiment_anticipation -0.011829   0.003236  -3.655 0.000263 ***
## sentiment_disgust      -0.043519   0.008848  -4.918 9.28e-07 ***
## sentiment_fear          0.004288   0.010563   0.406 0.684818    
## sentiment_joy           0.008502   0.002932   2.899 0.003772 ** 
## sentiment_sadness      -0.011507   0.008743  -1.316 0.188273    
## sentiment_surprise      0.019361   0.005679   3.409 0.000662 ***
## sentiment_trust        -0.001614   0.002891  -0.558 0.576733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.036 on 2541 degrees of freedom
## Multiple R-squared:  0.04133,    Adjusted R-squared:  0.03831 
## F-statistic: 13.69 on 8 and 2541 DF,  p-value: < 2.2e-16

Based on the model result, we can see that disgust and surprise are significant emotions to predict hotel ratings, along with anticipation and joy.

3. Sentiment Analysis (By Emotional Words)

After taking advantage of the bing dictionary, we turned to the sentiment analysis with the usage of NRC dictionary. Basically NRC is the dictionary which assigns each word 8 primary emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and the sentiment of positive or negative. This visualization would offer us a new angle of view to get a sense of people’s sentiment of their review. As shown above, the sentiment of positive is related much to the words of “clean”, “good”, “breakfast”, “friendly”, “helpful”, etc; while the sentiment of negative is related much to “small”, “bad”, “noise”, “late”, “dirty”. etc. In addition, we could also notice that some of the words with positive sentiment are also related to several positive emotions.For instance, “good” is assigned with the emotions of anticipation, joy, surprise, and trust; “friendly” is related to the emotions of anticipation, joy, and trust; and “clean” is also related to “joy” and “trust”.

4. Sentiment Analysis: Score Distribution

After taking advantage of the NRC dictionary, we turned to the Afinn dictionary, which assign every word with the score from -5 to +5. The more negative score a word has, the more negative that word is, and vice versa. The bar plot shown above conveys the information that most of the words used by people in their hotel review are with the score of 2 and 3, which are not that positive and kind of moderate with the standard of Afinn. Similarly, the negative side also has most words with the score of -1 and -2.